Sync master with upstream release b5280 #77

jan-service-account · 2025-05-05T00:09:13Z

Updates dev branch with latest release (b5280) from ggml-org/llama.cpp

Signed-off-by: xiaofei <[email protected]>

* docker : do not build tests * include "ggml-cpu.h"

* arg : allow using -hf offline * add more comments in code [no ci]

z17 compilation requires GCC 15.1.0 and onwards Signed-off-by: Aaron Teo <[email protected]>

Build fails with compilation error on power pc. This patch fixes the same. Tested with unit tests run via --build <build_dir> && cd <build_dir> && make test Signed-off-by: Shalini Salomi Bodapati <[email protected]>

* convert : improve model arch handling * use AutoConfig * rm trust_remote_code * Update convert_hf_to_gguf.py * fix self.block_count for vision * fix NomicBertModel

* arg : -hf do not fail if url mismatch * do not return if cannot parse metadata json

ggml-ci

…ggml-org#13223)

* convert ok * load ok, missing patch merger * ah sheet it works * update llava/readme * add test * fix test

* whisper: suppress Windows compiler warnings This commit disables compiler warnings on window using MSVC. The motivation for these changes is that some compilers generate warnings for these conversion, for example Windows MSVC, and there are quite a few of them. This makes it a little difficult to spot new warnings that may be introduced and also can be difficult for users/embedders of ggml where these warnings are hard to separate from their own warnings. * squash! whisper: suppress Windows compiler warnings Move ggml related warnings into ggml. This commit also fixes the indentation and adds a missing whitespace to the if statement.

This commit adds a check to makes sure that the target exists before trying to add compile options to ignore warnings when using MSVC. The motivation for this is currently the build is broken depending on the cmake options provided. With this fix it should be possible to build even if the targets are not actually available. Refs: ggml-org/whisper.cpp#3090 (comment)

ggml-ci

…der (ggml-org#13191) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader

* vulkan: Add bfloat16 support This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension. It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that. The coopmat support also requires a glslc that supports the extension, which currently requires a custom build. * vulkan: Support bf16 tensors without the bf16 extension or coopmat support Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available. * vulkan: bfloat16 fixes (really works without bfloat16 support now) * vulkan: fix spirv-val failure and reenable -O

* update GLM4 chat template * Update chat template Co-authored-by: Xuan-Son Nguyen <[email protected]> --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

… image size (ggml-org#13237)

* build : fix build info on windows * fix cuda host compiler msg

The following scenario will cause an assertion failure in the graph allocator: - Build and allocate a graph containing a tensor with a non-NULL data pointer - Build and allocate a new graph where that data is NULL Result: ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed This happens during revalidation because we think that memory should have been previously allocated based on the current graph but in reality the previous graph was different. In this situation, we should do a full reallocation pass.

Zero out the name and padding buffers.

…rg#13246)

* server : add cache reuse card link to help * args : use short url

…13244) * fix out_of_range error to keep the chat loop running * Update examples/llava/mtmd-cli.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * mtmd-cli : load image right away * add a new line for readability * rm printf * Update examples/llava/mtmd-cli.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update examples/llava/mtmd-cli.cpp --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]>

* reset glmedge chat template * fix glmedge chat template

* kv-cache : serparate recurrent vs non-recurrent impl (wip) ggml-ci * kv-cache : init -> contructor + add llama_memory_params ggml-ci * kv-cache : fix callback reference ggml-ci * context : llama_kv_cache -> llama_memory_i ggml-ci * context : move memory creation logic to model ggml-ci * llama : remove reference of memory during encode ggml-ci * kv-cache : hide padding details in the implementation ggml-ci * kv-cache : add ubatch_next() ggml-ci * context : simplify sbatch logic ggml-ci * kv-cache : hide defrag logic in the implementation ggml-ci * context : hide kv cache details in implementation ggml-ci * build : fix ggml-ci * cont : another fix ggml-ci * kv-cache : simplify interface (wip) ggml-ci * kv-cache : use separate KV cell structs for unified/recurrent ggml-ci * kv-cache : clean-up ggml-ci * model : better llama_model::create_model() signature ggml-ci * kv-cache : fix recurrent seq_rm() ggml-ci * kv-cache : replace `struct callbacks` with `llama_model &` ggml-ci * kv-cache : replace `struct graph_params` with `llama_context &` ggml-ci * kv-cache : fix offload check ggml-ci * context : avoid passing unique_ptr ggml-ci * kv-cache : avoid using the backends from the llama_context ref ggml-org#13113 ggml-ci * kv-cache : more consistent debug logs [no ci] * kv-cache : do not pass the full llama_context for kv graphs ggml-ci * kv-cache : remove comment * kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext ggml-ci * kv-cache : fix recurrent multi-user case ggml-ci * memory : remove comments [no ci]

…gml-org#13209) * wip * qwen2.5vl ok * vision: fix models missing "text_config" * add test * fix test repo name * fix 32B model * Revert "fix 32B model" This reverts commit 651752f. * clarify about 32B * rm qwen surgery script * update llava/readme * move V_ENC_EMBD_PATCH handling to Qwen2VLVisionModel

…g#13216)

…en (ggml-org#13245)

This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type. This change results in 9x - 40x gains in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark. The patch is tested with Meta-Lllama-3-8B, and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine. Signed-off-by: Shalini Salomi Bodapati <[email protected]>

ggml-ci

* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204) * vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW) * review: remove src_x/y < 0 checks; add performance tests * sync : ggml ggml-ci * vulkan : fix lint (#0) --------- Co-authored-by: Acly <[email protected]>

* llama : move end-user examples to tools directory --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

…hange) (ggml-org#13259)

…#13266) Support f16->f32 copy. Support f16->f16 and f32->f32 unary ops. Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.

Signed-off-by: Aaron Teo <[email protected]>

* init * wip * working version * add mtmd::bitmaps * add test target * rm redundant define * test: mtmd_input_chunks_free * rm outdated comment * fix merging issue * explicitly create mtmd::input_chunks * mtmd_input_chunk_copy * add clone() * add const to various places * add warning about breaking changes * helper: use mtmd_image_tokens_get_n_pos

JohannesGaessler and others added 30 commits April 29, 2025 23:32

scripts: n_depth for compare-llama-bench [no ci] (ggml-org#13201)

19e899c

rpc : fix cache directory initialization (ggml-org#13188)

a0f7016

Signed-off-by: xiaofei <[email protected]>

docker : do not build tests (ggml-org#13204)

da84c04

* docker : do not build tests * include "ggml-cpu.h"

arg : allow using -hf offline (ggml-org#13202)

5933e6f

* arg : allow using -hf offline * add more comments in code [no ci]

feat(ggml-cpu): enable z17 compile (ggml-org#13182)

44cd8d9

z17 compilation requires GCC 15.1.0 and onwards Signed-off-by: Aaron Teo <[email protected]>

convert : correct typo image_mean --> image_std (ggml-org#13208)

07c2e2f

ggml : fix ppc64le build (ggml-org#13176)

4163137

Build fails with compilation error on power pc. This patch fixes the same. Tested with unit tests run via --build <build_dir> && cd <build_dir> && make test Signed-off-by: Shalini Salomi Bodapati <[email protected]>

vulkan: use uint array index to avoid glslang bug (ggml-org#13193)

e5007a5

common : add -jf / --json-schema-file flag (ggml-org#12011)

3b127c7

llava : remove duplicate include (ggml-org#13207)

ceda28e

convert : improve model arch handling (ggml-org#13122)

3e168be

* convert : improve model arch handling * use AutoConfig * rm trust_remote_code * Update convert_hf_to_gguf.py * fix self.block_count for vision * fix NomicBertModel

fix typo: n_ctx_pre_seq -> n_ctx_per_seq (ggml-org#13221)

16a457f

arg : -hf do not fail if url mismatch (ggml-org#13219)

6f67cf1

* arg : -hf do not fail if url mismatch * do not return if cannot parse metadata json

CUDA: batched+noncont MMQ, refactor bs>1 MoE code (ggml-org#13199)

e1e8e09

cuda : fix unused variable compile warning (whisper/0)

9998540

ggml-ci

ggml : fix ggml_gallocr_ptr type (ggml/1205)

4254bb4

sync : ggml

8d33d74

llama-model : fix the reported size class for nomic-embed-text-v2-moe (…

a70183e

…ggml-org#13223)

arg : remove CURLINFO_EFFECTIVE_METHOD (ggml-org#13228)

13c9a33

mtmd : add **vision** support for Mistral Small 3.1 (ggml-org#13231)

8936784

* convert ok * load ok, missing patch merger * ah sheet it works * update llava/readme * add test * fix test

sync : ggml

b1dd4d0

ggml-ci

test: non-cont. b in test-backend-ops -o MUL_MAT (ggml-org#13187)

b0ecbd4

vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul sha…

fc727bc

…der (ggml-org#13191) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader

llama-chat : update GLM4 chat template (ggml-org#13238)

e0f572c

* update GLM4 chat template * Update chat template Co-authored-by: Xuan-Son Nguyen <[email protected]> --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

clip : (minicpmv) Re-enable upscaling of images smaller than the CLIP…

b6e4ff6

… image size (ggml-org#13237)

build : fix build info on windows (ggml-org#13239)

d7a14c4

* build : fix build info on windows * fix cuda host compiler msg

justinsb and others added 28 commits May 1, 2025 23:32

rpc : avoid uninitialized memory in serialize_tensor (ggml-org#13210)

8efbdad

Zero out the name and padding buffers.

ci: fix cross-compile sync issues (ggml-org#12804)

d24d592

convert : explicitly disable trust_remote_code for AutoConfig (ggml-o…

dcf8860

…rg#13246)

server : add cache reuse card link to help (ggml-org#13230)

fab647e

* server : add cache reuse card link to help * args : use short url

llama-chat : reset glmedge chat template (ggml-org#13253)

2af6880

* reset glmedge chat template * fix glmedge chat template

llama : plamo rope type is neox (ggml-org#13260)

626083f

llama : orion rope type is neox (ggml-org#13261)

cb06a3c

convert : use correct context length for nomic-embed-text-v2 (ggml-or…

7d21234

…g#13216)

llama-model : support Qwen2 embedding models and pooling_mode_lasttok…

2f56761

…en (ggml-org#13245)

context : fix reorder logic (ggml-org#13267)

a75cb30

ggml-ci

llama : move end-user examples to tools directory (ggml-org#13249)

1d36b36

* llama : move end-user examples to tools directory --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

llama : Llama-3_1-Nemotron-Ultra-253B-v1 support (ggml-org#12843)

3bf785f

clip : revert the change of BOI/EOI token for GLM-edge (⚠️ breaking c…

36667c8

…hange) (ggml-org#13259)

imatrix: fix oob writes if src1 is not contiguous (ggml-org#13286)

3e959f0

vulkan: Additional type support for unary, binary, and copy (ggml-org…

8ae5ebc

…#13266) Support f16->f32 copy. Support f16->f16 and f32->f32 unary ops. Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.

CUDA: fix race condition in MMQ ids_dst (ggml-org#13294)

8afbd96

CUDA: fix race condition in MMQ stream-k fixup (ggml-org#13299)

93c4e23

llama : build windows releases with dl backends (ggml-org#13220)

9f2da58

llava/mtmd : fixes to fully support dl backends (ggml-org#13303)

86bd60d

ggml : activate s390x simd for Q3_K (ggml-org#13301)

6eb7d25

Signed-off-by: Aaron Teo <[email protected]>

rpc : use backend registry, support dl backends (ggml-org#13304)

9fdfcda

Merge branch 'dev' into update-dev-from-master-2025-05-05-00-09

a483ab2

vansangpfiev closed this May 12, 2025

Minh141120 deleted the update-dev-from-master-2025-05-05-00-09 branch July 22, 2025 03:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync master with upstream release b5280 #77

Sync master with upstream release b5280 #77

Uh oh!

jan-service-account commented May 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

25 participants

Sync master with upstream release b5280 #77

Sync master with upstream release b5280 #77

Uh oh!

Conversation

jan-service-account commented May 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

25 participants